Skip to content

ARROW-17524: [C++] Correction for fields included when reading an ORC table#13962

Merged
pitrou merged 16 commits into
apache:masterfrom
LouisClt:CorrectionForOrcIncludeIndices
Oct 18, 2022
Merged

ARROW-17524: [C++] Correction for fields included when reading an ORC table#13962
pitrou merged 16 commits into
apache:masterfrom
LouisClt:CorrectionForOrcIncludeIndices

Conversation

@LouisClt

Copy link
Copy Markdown
Contributor

I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.

The definitions of the corresponding ORC methods are here :
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191

and
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207

@github-actions

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou

pitrou commented Aug 25, 2022

Copy link
Copy Markdown
Member

@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above?

@LouisClt LouisClt changed the title Correction for fields included when reading an ORC table ARROW-17524: [C++] Correction for fields included when reading an ORC table Aug 25, 2022
@LouisClt

Copy link
Copy Markdown
Contributor Author

@pitrou Yes I can.
Here is the new Jira https://issues.apache.org/jira/browse/ARROW-17524

@LouisClt LouisClt marked this pull request as ready for review August 25, 2022 13:50
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@LouisClt

Copy link
Copy Markdown
Contributor Author

C Glib tests do not pass because it test the old behaviour.
Looking a bit more in detail at the difference between "include" and "includeTypes", it appears that "includeTypes" is based on indices of the tables types. The root of the tree (meaning the whole table) is type of index 0, and when there are complex types such as structs would give other "child" types. See https://orc.apache.org/docs/types.html for more information.
"include" select the fields in the same way than the other imports : each table has a certain number of fields, (we do not take into account the "children" types), if we want to select the first field, we put "0" in the list.

I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable.

@LouisClt

Copy link
Copy Markdown
Contributor Author

So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated.
There is a decision to be made whether or not the intended behaviour of SelectFields is the previous one or this one.

Comment thread c_glib/test/test-orc-file-reader.rb Outdated
Comment thread ruby/red-arrow/test/test-orc.rb Outdated
Comment thread cpp/src/arrow/adapters/orc/adapter.cc
@pitrou

pitrou commented Oct 5, 2022

Copy link
Copy Markdown
Member

@LouisClt Feel free to say when this is ready for another review.

@LouisClt LouisClt requested a review from pitrou October 6, 2022 07:29
@LouisClt

LouisClt commented Oct 6, 2022

Copy link
Copy Markdown
Contributor Author

Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related.

std::vector<int> selected_indices = {1, 3};
AssertTableWriteReadEqual(table, table_selected, kDefaultSmallMemStreamSize / 16,
&selected_indices);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test but:

  • can we also test with non-empty data?
  • can we test selecting a field that's after the struct (to ensure field numbering is as expected)?

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @LouisClt . This looks good to me now.

@pitrou

pitrou commented Oct 18, 2022

Copy link
Copy Markdown
Member

CI failures are unrelated.

@pitrou pitrou merged commit d35bf87 into apache:master Oct 18, 2022
@LouisClt

Copy link
Copy Markdown
Contributor Author

Good to know, thanks @pitrou

kou pushed a commit that referenced this pull request Oct 20, 2022
… table (#13962)

I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.

The definitions of the corresponding ORC methods are here : 
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191

and 
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207

Lead-authored-by: LouisClt <louis1110@hotmail.fr>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants